2025-03-18
Regression Models for Count Outcomes
Chapters 24-26 of the Course Notes describe this material, as well as hurdle models (next class) and tobit regression, and some additional issues with certain types of count models.
countreg and topmodels packagesTo build rootograms to visualize the results of regression models on count outcomes, I have decided for the moment to continue to use the countreg and topmodels packages, which are currently available only on R-Forge. To install, type:
into the R Console within R Studio.
knitr::opts_chunk$set(comment=NA)
library(janitor); library(gt); library(broom)
library(mosaic); library(Hmisc); library(patchwork)
library(rsample); library(yardstick); library(here)
library(conflicted) ## resolve conflicts
library(topmodels) ## for rootograms
library(MASS) ## for glm.nb to fit NB models
library(pscl) ## for zero-inflated and hurdle fits
library(lmtest) ## for Vuong test
library(easystats)
library(tidyverse)
conflicts_prefer(dplyr::select(), dplyr::filter(), base::max(),
base::sum(), yardstick::rmse(), yardstick::mae(),
pscl::zeroinfl())
theme_set(theme_bw())We want to build a generalized linear model to predict count data using one or more predictors.
Count data are non-negative integers (0, 1, 2, 3, …)
We’ll use the Poisson and Negative Binomial probability distributions.
The Poisson probability model describes the probability of a given number of events occurring in a fixed interval of time or space.
We will use glm to fit Poisson models, by using family = "Poisson".
A Poisson model might fit poorly due to overdispersion, where the variance of Y is larger than we’d expect based on the mean of Y.
Instead, we’ll look at other ways (especially zero-inflation and the negative binomial models) to address overdispersion.
We will demonstrate the use of glm.nb() from the MASS package to fit negative binomial regression models.
We’ll use zeroinfl() from pscl to fit ZIP and ZINB regressions.
A hurdle model predicts the count outcome by making an assumption that there are two processes at work:
We use hurdle() from pscl to fit these.
countreg and topmodels packages to help us.lmtest package) to help us make decisions between various types of Poisson models or various types of Negative Binomial models on the basis of improvement in fit of things like bias-corrected AIC or BIC.medicare datamedicare exampleSource: NMES1988 data in R’s AER package, cleaned up to medicare.csv.
Essentially the same data are used in from the University of Virginia on hurdle models.
Data are a cross-section US National Medical Expenditure Survey (NMES) conducted in 1987 and 1988. The NMES is based upon a representative, national probability sample of the civilian non-institutionalized population and individuals admitted to long-term care facilities during 1987.
medicare dataThe data are a subsample of individuals ages 66 and over all of whom are covered by Medicare (a public insurance program providing substantial protection against health-care costs), and some of whom also have private supplemental insurance.
medicare code book| Variable | Description |
|---|---|
subject |
subject number (code) |
visits |
outcome: # of physician office visits |
hospital |
# of hospital stays |
health |
self-rated health (poor, average, excellent) |
chronic |
# of chronic conditions |
sex |
male or female |
school |
years of education |
insurance |
subject (also) has private insurance? (yes/no) |
Predict visits using main effects of the 6 predictors (excluding subject)
medicare tibble# A tibble: 4,406 × 7
visits hospital health chronic sex school insurance
<dbl> <dbl> <fct> <dbl> <fct> <dbl> <fct>
1 5 1 average 2 male 6 yes
2 1 0 average 2 female 10 yes
3 13 3 poor 4 female 10 no
4 16 1 poor 2 male 3 yes
5 3 0 average 2 female 6 yes
6 17 0 poor 5 female 7 no
7 9 0 average 0 female 8 yes
8 3 0 average 0 female 8 yes
9 1 0 average 0 female 8 yes
10 0 0 average 0 female 8 yes
# ℹ 4,396 more rows
medicare visits hospital health chronic
Min. : 0.000 Min. :0.000 average :3509 Min. :0.000
1st Qu.: 1.000 1st Qu.:0.000 poor : 554 1st Qu.:1.000
Median : 4.000 Median :0.000 excellent: 343 Median :1.000
Mean : 5.774 Mean :0.296 Mean :1.542
3rd Qu.: 8.000 3rd Qu.:0.000 3rd Qu.:2.000
Max. :89.000 Max. :8.000 Max. :8.000
sex school insurance
male :1778 Min. : 0.00 yes:3421
female:2628 1st Qu.: 8.00 no : 985
Median :11.00
Mean :10.29
3rd Qu.:12.00
Max. :18.00
insuranceI want No first, then Yes, when building models.
visits min Q1 median Q3 max mean sd n missing
0 1 4 8 89 5.774399 6.759225 4406 0
medicare$visits
n missing distinct Info Mean pMedian Gmd .05
4406 0 60 0.992 5.774 4.5 6.227 0
.10 .25 .50 .75 .90 .95
0 1 4 8 13 17
lowest : 0 1 2 3 4, highest: 63 65 66 68 89
visitsCreating Training and Testing Samples with rsample functions…
I’ve held out 25% of the medicare data for the test sample.
Predict visits using some combination of these 6 predictors…
| Predictor | Description |
|---|---|
hospital |
# of hospital stays |
health |
self-rated health (poor, average, excellent) |
chronic |
# of chronic conditions |
sex |
male or female |
school |
years of education |
insurance |
subject (also) has private insurance? (yes/no) |
We’ll build separate training and test samples to help us validate.
mod_1: A Poisson RegressionAssume our count data (visits) follows a Poisson distribution with a mean conditional on our predictors.
The Poisson model uses a logarithm as its link function, so the model is actually predicting log(visits).
Note that we’re fitting the model here using the training sample alone.
mod_1 Summary
Call:
glm(formula = visits ~ hospital + health + chronic + sex + school +
insurance, family = "poisson", data = med_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.886576 0.028813 30.770 < 2e-16 ***
hospital 0.163555 0.006710 24.374 < 2e-16 ***
healthpoor 0.309610 0.020244 15.294 < 2e-16 ***
healthexcellent -0.358758 0.034875 -10.287 < 2e-16 ***
chronic 0.137349 0.005266 26.082 < 2e-16 ***
sexfemale 0.098325 0.014805 6.641 3.11e-11 ***
school 0.031258 0.002111 14.808 < 2e-16 ***
insuranceyes 0.200249 0.019484 10.278 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 20618 on 3303 degrees of freedom
Residual deviance: 17598 on 3296 degrees of freedom
AIC: 27232
Number of Fisher Scoring iterations: 5
mod_1 (Poisson) model coefficients| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.887 | 0.029 | 30.770 | 0.000 |
| hospital | 0.164 | 0.007 | 24.374 | 0.000 |
| healthpoor | 0.310 | 0.020 | 15.294 | 0.000 |
| healthexcellent | −0.359 | 0.035 | −10.287 | 0.000 |
| chronic | 0.137 | 0.005 | 26.082 | 0.000 |
| sexfemale | 0.098 | 0.015 | 6.641 | 0.000 |
| school | 0.031 | 0.002 | 14.808 | 0.000 |
| insuranceyes | 0.200 | 0.019 | 10.278 | 0.000 |
Harry and Larry have the same values for all other predictors but only Harry has private insurance. mod_1 estimates Harry’s log(visits) to be 0.2 larger than Larry’s log(visits).
OK, you ran a regression/fit a linear model and some of your variables are log-transformed.
Source is here for further reference (also see today’s README.)
Exponentiate the coefficient. This gives the multiplicative factor for every one-unit increase in the independent variable.
from this source
Divide the coefficient by 100. This tells us that a 1% increase in the independent variable increases (or decreases) the dependent variable by (coefficient/100) units.
Example: the coefficient is 0.198. 0.198/100 = 0.00198. For every 1% increase in the independent variable, our dependent variable increases by about 0.002.
For x percent increase, multiply the coefficient by log(1.x).
Example: For every 10% increase in the independent variable, our dependent variable increases by about 0.198 * log(1.10) = 0.02
from this source
Interpret the coefficient as the percent increase in the dependent variable for every 1% increase in the independent variable.
from this source
mod_1 parametersParameter | IRR | SE | 90% CI | z | p
---------------------------------------------------------------------
(Intercept) | 2.43 | 0.07 | [2.31, 2.54] | 30.77 | < .001
hospital | 1.18 | 7.90e-03 | [1.16, 1.19] | 24.37 | < .001
health [poor] | 1.36 | 0.03 | [1.32, 1.41] | 15.29 | < .001
health [excellent] | 0.70 | 0.02 | [0.66, 0.74] | -10.29 | < .001
chronic | 1.15 | 6.04e-03 | [1.14, 1.16] | 26.08 | < .001
sex [female] | 1.10 | 0.02 | [1.08, 1.13] | 6.64 | < .001
school | 1.03 | 2.18e-03 | [1.03, 1.04] | 14.81 | < .001
insurance [yes] | 1.22 | 0.02 | [1.18, 1.26] | 10.28 | < .001
insurance (holding other predictors constant) is associated with a 22% increase in our outcome, edu.mod_1 performance summaries# Indices of model performance
AIC | AICc | BIC | Nagelkerke's R2 | RMSE | Sigma | Score_log | Score_spherical
-------------------------------------------------------------------------------------------------
27231.869 | 27231.912 | 27280.692 | 0.600 | 6.594 | 1.000 | -4.119 | 0.013
See the next slide for details on how to interpret this…
mod_1) doesn’t fit enough zeros or ones, and fits too many 3-12 values, then not enough of the higher values.mod_1 (plots 1-2)mod_1 (plots 3-4)mod_1 (plots 5-6)mod_1 PredictionsWe’ll use the augment function to store the predictions within our training sample. Note the use of "response" to predict visits, not log(visits).
mod_1 FitWithin our training sample, mod_1_aug now contains both the actual counts (visits) and the predicted counts (in .fitted) from mod_1. We’ll summarize the fit…
mets <- metric_set(rsq, rmse, mae)
mod_1_summary <-
mets(mod_1_aug, truth = visits, estimate = .fitted) |>
mutate(model = "mod_1") |> relocate(model)
mod_1_summary |> gt() |> fmt_number(decimals = 3)| model | .metric | .estimator | .estimate |
|---|---|---|---|
| mod_1 | rsq | standard | 0.100 |
| mod_1 | rmse | standard | 6.594 |
| mod_1 | mae | standard | 4.189 |
These will become interesting as we build additional models.
mod_2: A Negative Binomial RegressionThe negative binomial model requires the estimation of an additional parameter, called \(\theta\) (theta). The default link for this generalized linear model is also a logarithm, like the Poisson.
The estimated dispersion parameter value \(\theta\) is…
The Poisson model is essentially the negative binomial model assuming a known \(\theta = 1\).
mod_2 summary
Call:
MASS::glm.nb(formula = visits ~ hospital + health + chronic +
sex + school + insurance, data = med_train, init.theta = 1.211089878,
link = log)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.765746 0.065602 11.673 < 2e-16 ***
hospital 0.224205 0.022745 9.857 < 2e-16 ***
healthpoor 0.360067 0.055608 6.475 9.47e-11 ***
healthexcellent -0.335591 0.070353 -4.770 1.84e-06 ***
chronic 0.169070 0.013887 12.174 < 2e-16 ***
sexfemale 0.109443 0.035920 3.047 0.00231 **
school 0.030763 0.005037 6.107 1.02e-09 ***
insuranceyes 0.237080 0.045780 5.179 2.23e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for Negative Binomial(1.2111) family taken to be 1)
Null deviance: 4341.7 on 3303 degrees of freedom
Residual deviance: 3783.1 on 3296 degrees of freedom
AIC: 18328
Number of Fisher Scoring iterations: 1
Theta: 1.2111
Std. Err.: 0.0388
2 x log-likelihood: -18309.8280
mod_2 (NB) coefficients| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.766 | 0.066 | 11.673 | 0.000 |
| hospital | 0.224 | 0.023 | 9.857 | 0.000 |
| healthpoor | 0.360 | 0.056 | 6.475 | 0.000 |
| healthexcellent | −0.336 | 0.070 | −4.770 | 0.000 |
| chronic | 0.169 | 0.014 | 12.174 | 0.000 |
| sexfemale | 0.109 | 0.036 | 3.047 | 0.002 |
| school | 0.031 | 0.005 | 6.107 | 0.000 |
| insuranceyes | 0.237 | 0.046 | 5.179 | 0.000 |
mod_2 parametersParameter | IRR | SE | 90% CI | z | p
--------------------------------------------------------------------
(Intercept) | 2.15 | 0.14 | [1.93, 2.40] | 11.67 | < .001
hospital | 1.25 | 0.03 | [1.20, 1.30] | 9.86 | < .001
health [poor] | 1.43 | 0.08 | [1.31, 1.57] | 6.48 | < .001
health [excellent] | 0.71 | 0.05 | [0.64, 0.80] | -4.77 | < .001
chronic | 1.18 | 0.02 | [1.16, 1.21] | 12.17 | < .001
sex [female] | 1.12 | 0.04 | [1.05, 1.18] | 3.05 | 0.002
school | 1.03 | 5.19e-03 | [1.02, 1.04] | 6.11 | < .001
insurance [yes] | 1.27 | 0.06 | [1.17, 1.37] | 5.18 | < .001
mod_2 performance summaries# Indices of model performance
AIC | AICc | BIC | Nagelkerke's R2 | RMSE | Sigma | Score_log | Score_spherical
-------------------------------------------------------------------------------------------------
18327.828 | 18327.883 | 18382.754 | 0.213 | 6.941 | 1.000 | -2.938 | 0.015
Does this look better than the Poisson rootogram?
mod_2 (plots 1-2)mod_2 (plots 3-4)mod_2 (plots 5-6)mod_2 Predictionsmod_2_aug <- augment(mod_2, med_train, type.predict = "response")
mod_2_aug |> select(subject, visits, .fitted) |> head(3)# A tibble: 3 × 3
subject visits .fitted
<chr> <dbl> <dbl>
1 355 19 5.22
2 2661 3 4.08
3 2895 0 4.39
negbin models. I’d silence it, as I have here.mod_2mod_2_aug has actual (visits) and predicted counts (in .fitted.)
The reasonable things to summarize in sample look like the impressions from the rootograms and the summaries we’ve prepared so far.
| Model | Rootogram impressions |
|---|---|
mod_1 (P) |
Many problems. Data appear overdispersed. |
mod_2 (NB) |
Still not enough zeros; some big predictions. |
mod_3: Zero-Inflated Poisson (ZIP) ModelThe zero-inflated Poisson model describes count data with an excess of zero counts.
The model posits that there are two processes involved:
We’ll use the pscl package to fit zero-inflated models.
mod_3 ZIP coefficientsSadly, there’s no broom tidying functions for these zero-inflated models.
Call:
pscl::zeroinfl(formula = visits ~ hospital + health + chronic + sex +
school + insurance, data = med_train)
Pearson residuals:
Min 1Q Median 3Q Max
-5.4401 -1.1618 -0.4769 0.5699 24.3630
Count model coefficients (poisson with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.298380 0.029608 43.853 < 2e-16 ***
hospital 0.160449 0.006780 23.664 < 2e-16 ***
healthpoor 0.302326 0.020022 15.099 < 2e-16 ***
healthexcellent -0.281826 0.035775 -7.878 3.33e-15 ***
chronic 0.097090 0.005420 17.913 < 2e-16 ***
sexfemale 0.056219 0.014934 3.765 0.000167 ***
school 0.023367 0.002139 10.924 < 2e-16 ***
insuranceyes 0.093169 0.019794 4.707 2.52e-06 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.25623 0.16807 1.525 0.127379
hospital -0.29538 0.10252 -2.881 0.003962 **
healthpoor -0.09642 0.19067 -0.506 0.613089
healthexcellent 0.32851 0.16898 1.944 0.051891 .
chronic -0.49292 0.05183 -9.510 < 2e-16 ***
sexfemale -0.36437 0.10295 -3.539 0.000401 ***
school -0.06165 0.01410 -4.373 1.22e-05 ***
insuranceyes -0.66566 0.11996 -5.549 2.87e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of iterations in BFGS optimization: 22
Log-likelihood: -1.222e+04 on 16 Df
mod_3 parameters# Fixed Effects
Parameter | Log-Mean | SE | 90% CI | z | p
--------------------------------------------------------------------------
(Intercept) | 1.30 | 0.03 | [ 1.25, 1.35] | 43.85 | < .001
hospital | 0.16 | 6.78e-03 | [ 0.15, 0.17] | 23.66 | < .001
health [poor] | 0.30 | 0.02 | [ 0.27, 0.34] | 15.10 | < .001
health [excellent] | -0.28 | 0.04 | [-0.34, -0.22] | -7.88 | < .001
chronic | 0.10 | 5.42e-03 | [ 0.09, 0.11] | 17.91 | < .001
sex [female] | 0.06 | 0.01 | [ 0.03, 0.08] | 3.76 | < .001
school | 0.02 | 2.14e-03 | [ 0.02, 0.03] | 10.92 | < .001
insurance [yes] | 0.09 | 0.02 | [ 0.06, 0.13] | 4.71 | < .001
# Zero-Inflation
Parameter | Log-Odds | SE | 90% CI | z | p
----------------------------------------------------------------------
(Intercept) | 0.26 | 0.17 | [-0.02, 0.53] | 1.52 | 0.127
hospital | -0.30 | 0.10 | [-0.46, -0.13] | -2.88 | 0.004
health [poor] | -0.10 | 0.19 | [-0.41, 0.22] | -0.51 | 0.613
health [excellent] | 0.33 | 0.17 | [ 0.05, 0.61] | 1.94 | 0.052
chronic | -0.49 | 0.05 | [-0.58, -0.41] | -9.51 | < .001
sex [female] | -0.36 | 0.10 | [-0.53, -0.20] | -3.54 | < .001
school | -0.06 | 0.01 | [-0.08, -0.04] | -4.37 | < .001
insurance [yes] | -0.67 | 0.12 | [-0.86, -0.47] | -5.55 | < .001
mod_3 performance summaries# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
---------------------------------------------------------------------
24467.627 | 24467.792 | 24565.273 | 0.656 | 0.656 | 6.560 | 6.576
AIC | Score_log | Score_spherical
---------------------------------------
24467.627 | -3.698 | 0.013
glance() available.mod_3 modelmod_3 PredictionsWe have no augment or other broom functions available for zero-inflated models, so …
mod_3_aug <- med_train |>
mutate(".fitted" = predict(mod_3, type = "response"),
".resid" = resid(mod_3, type = "response"))
mod_3_aug |> select(subject, visits, .fitted, .resid) |>
head(3)# A tibble: 3 × 4
subject visits .fitted .resid
<chr> <dbl> <dbl> <dbl>
1 355 19 5.31 13.7
2 2661 3 4.21 -1.21
3 2895 0 4.59 -4.59
mod_3 Fitmod_3_aug now has actual (visits) and predicted counts (in .fitted) from mod_3, just as we set up for the previous two models.
mod_3bind_rows(mod_1_summary, mod_2_summary, mod_3_summary) |>
pivot_wider(names_from = model,
values_from = .estimate) |>
gt() |> fmt_number(decimals = 3) |>
tab_options(table.font.size = 20)| .metric | .estimator | mod_1 | mod_2 | mod_3 |
|---|---|---|---|---|
| rsq | standard | 0.100 | 0.078 | 0.108 |
| rmse | standard | 6.594 | 6.941 | 6.560 |
| mae | standard | 4.189 | 4.252 | 4.164 |
Remember we want a larger \(R^2\) and smaller values of RMSE and MAE.
Vuong’s test compares predicted probabilities (for each count) in two non-nested models. How about Poisson vs. ZIP?
Vuong Non-Nested Hypothesis Test-Statistic:
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
-------------------------------------------------------------
Vuong z-statistic H_A p-value
Raw -14.59671 model2 > model1 < 2.22e-16
AIC-corrected -14.51271 model2 > model1 < 2.22e-16
BIC-corrected -14.25638 model2 > model1 < 2.22e-16
The large negative z-statistic indicates mod_3 (ZIP) fits better than mod_1 (Poisson) in our training sample.
mod_4: Zero-Inflated Negative Binomial (ZINB) ModelAs in the ZIP, we assume there are two processes involved:
We’ll use the pscl package again and the zeroinfl function.
mod_4 summary
Call:
zeroinfl(formula = visits ~ hospital + health + chronic + sex + school +
insurance, data = med_train, dist = "negbin")
Pearson residuals:
Min 1Q Median 3Q Max
-1.1943 -0.7072 -0.2773 0.3347 17.2775
Count model coefficients (negbin with log link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.059477 0.070188 15.095 < 2e-16 ***
hospital 0.207087 0.023460 8.827 < 2e-16 ***
healthpoor 0.333854 0.052720 6.333 2.41e-10 ***
healthexcellent -0.288168 0.073627 -3.914 9.08e-05 ***
chronic 0.126107 0.013737 9.180 < 2e-16 ***
sexfemale 0.069573 0.035975 1.934 0.05312 .
school 0.025015 0.004983 5.020 5.18e-07 ***
insuranceyes 0.151514 0.048328 3.135 0.00172 **
Log(theta) 0.389048 0.040783 9.539 < 2e-16 ***
Zero-inflation model coefficients (binomial with logit link):
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.45740 0.31049 1.473 0.14071
hospital -0.84887 0.57525 -1.476 0.14004
healthpoor -0.33729 0.70242 -0.480 0.63110
healthexcellent 0.27654 0.33753 0.819 0.41262
chronic -1.16806 0.20536 -5.688 1.29e-08 ***
sexfemale -0.56382 0.24204 -2.329 0.01983 *
school -0.09233 0.03157 -2.925 0.00345 **
insuranceyes -1.00819 0.27195 -3.707 0.00021 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Theta = 1.4756
Number of iterations in BFGS optimization: 48
Log-likelihood: -9102 on 17 Df
mod_4 parameters# Fixed Effects
Parameter | Log-Mean | SE | 90% CI | z | p
--------------------------------------------------------------------------
(Intercept) | 1.06 | 0.07 | [ 0.94, 1.17] | 15.09 | < .001
hospital | 0.21 | 0.02 | [ 0.17, 0.25] | 8.83 | < .001
health [poor] | 0.33 | 0.05 | [ 0.25, 0.42] | 6.33 | < .001
health [excellent] | -0.29 | 0.07 | [-0.41, -0.17] | -3.91 | < .001
chronic | 0.13 | 0.01 | [ 0.10, 0.15] | 9.18 | < .001
sex [female] | 0.07 | 0.04 | [ 0.01, 0.13] | 1.93 | 0.053
school | 0.03 | 4.98e-03 | [ 0.02, 0.03] | 5.02 | < .001
insurance [yes] | 0.15 | 0.05 | [ 0.07, 0.23] | 3.14 | 0.002
# Zero-Inflation
Parameter | Log-Odds | SE | 90% CI | z | p
----------------------------------------------------------------------
(Intercept) | 0.46 | 0.31 | [-0.05, 0.97] | 1.47 | 0.141
hospital | -0.85 | 0.58 | [-1.80, 0.10] | -1.48 | 0.140
health [poor] | -0.34 | 0.70 | [-1.49, 0.82] | -0.48 | 0.631
health [excellent] | 0.28 | 0.34 | [-0.28, 0.83] | 0.82 | 0.413
chronic | -1.17 | 0.21 | [-1.51, -0.83] | -5.69 | < .001
sex [female] | -0.56 | 0.24 | [-0.96, -0.17] | -2.33 | 0.020
school | -0.09 | 0.03 | [-0.14, -0.04] | -2.92 | 0.003
insurance [yes] | -1.01 | 0.27 | [-1.46, -0.56] | -3.71 | < .001
mod_4 performance summaries# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
---------------------------------------------------------------------
18237.904 | 18238.090 | 18341.653 | 0.895 | 0.895 | 6.709 | 6.726
AIC | Score_log | Score_spherical
---------------------------------------
18237.904 | -2.853 | 0.014
glance() available.mod_4 modelmod_4 PredictionsAgain, there is no augment or other broom functions available for zero-inflated models, so …
mod_4_aug <- med_train |>
mutate(".fitted" = predict(mod_4, type = "response"),
".resid" = resid(mod_4, type = "response"))
mod_4_aug |> select(subject, visits, .fitted, .resid) |>
head(3)# A tibble: 3 × 4
subject visits .fitted .resid
<chr> <dbl> <dbl> <dbl>
1 355 19 5.29 13.7
2 2661 3 4.47 -1.47
3 2895 0 4.57 -4.57
mod_4 Fitmod_4_aug now has actual (visits) and predicted counts (in .fitted) from mod_4.
mod_4bind_rows(mod_1_summary, mod_2_summary,
mod_3_summary, mod_4_summary) |>
pivot_wider(names_from = model,
values_from = .estimate) |>
gt() |> fmt_number(decimals = 3) |>
tab_options(table.font.size = 20)| .metric | .estimator | mod_1 | mod_2 | mod_3 | mod_4 |
|---|---|---|---|---|---|
| rsq | standard | 0.100 | 0.078 | 0.108 | 0.094 |
| rmse | standard | 6.594 | 6.941 | 6.560 | 6.709 |
| mae | standard | 4.189 | 4.252 | 4.164 | 4.191 |
What do you think?
How about Negative Binomial vs. ZINB?
Vuong Non-Nested Hypothesis Test-Statistic:
(test-statistic is asymptotically distributed N(0,1) under the
null that the models are indistinguishible)
-------------------------------------------------------------
Vuong z-statistic H_A p-value
Raw 4.808304 model1 > model2 7.6108e-07
AIC-corrected 4.082004 model1 > model2 2.2325e-05
BIC-corrected 1.865741 model1 > model2 0.031039
The large positive z-statistics indicate mod_4 (ZINB) fits better than mod_2 (Negative Binomial) in our training sample.
Predict the visit counts for each subject in our test sample.
Combine the various predictions into a tibble with the original data.
m1_sum <- mets(test_res, truth = visits, estimate = pre_m1) |>
mutate(model = "mod_1")
m2_sum <- mets(test_res, truth = visits, estimate = pre_m2) |>
mutate(model = "mod_2")
m3_sum <- mets(test_res, truth = visits, estimate = pre_m3) |>
mutate(model = "mod_3")
m4_sum <- mets(test_res, truth = visits, estimate = pre_m4) |>
mutate(model = "mod_4")
test_sum <- bind_rows(m1_sum, m2_sum, m3_sum, m4_sum)test_sum <- bind_rows(m1_sum, m2_sum, m3_sum, m4_sum) |>
pivot_wider(names_from = model,
values_from = .estimate)
test_sum |>
select(-.estimator) |>
gt() |> fmt_number(decimals = 3) |>
tab_options(table.font.size = 20)| .metric | mod_1 | mod_2 | mod_3 | mod_4 |
|---|---|---|---|---|
| rsq | 0.103 | 0.108 | 0.099 | 0.097 |
| rmse | 7.212 | 7.205 | 5.907 | 5.967 |
| mae | 4.455 | 4.450 | 3.994 | 4.009 |
432 Class 17 | 2025-03-18 | https://thomaselove.github.io/432-2025/